feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
Open
goel-skd wants to merge 2 commits into
Open
feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760goel-skd wants to merge 2 commits into
goel-skd wants to merge 2 commits into
Conversation
bec8884 to
69cc006
Compare
Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.
69cc006 to
f42e2da
Compare
wgtmac
reviewed
Jun 19, 2026
| target_include_directories(utf8proc::utf8proc INTERFACE ${utf8proc_SOURCE_DIR}) | ||
| endif() | ||
|
|
||
| set(UTF8PROC_VENDORED TRUE) |
Member
There was a problem hiding this comment.
utf8proc is licensed under the permissive MIT License (along with some Unicode data under a similarly permissive license). We need to update the LICENSE File by adding a separator at the bottom similar to this:
---
This product bundles utf8proc, which is available under the MIT License:
Copyright © 2014-2021 by Steven G. Johnson, Jiahao Chen, Tony Kelman, Jonas Fonseca, and other contributors listed in the git history.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software... (include the rest of the utf8proc MIT license text here)
Contributor
Author
There was a problem hiding this comment.
Thanks @wgtmac. Good catch, let me update it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the ASCII-only
StringUtils::ToLowerwith a Unicode-awareimplementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).ToLowernow lower-cases UTF-8 input using utf8proc simple (1:1) casemapping (e.g.
CAFÉ→café,GROẞE→große). Invalid UTF-8 isreturned unchanged rather than erroring.
EqualsIgnoreCasenow compares the lowercased forms of both inputs, so itis case-insensitive for non-ASCII letters too.
ToUpperis intentionally left ASCII-only — it is not used for namematching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (
subprojects/utf8proc.wrap) builds.Testing
Added/updated
string_util_test.cc:ToLowerUnicode,ToUpperAsciiOnly,and Unicode cases in
EqualsIgnoreCase(including invalid-UTF-8pass-through).
Closes #613.
Follow-up to #748